Question 1

a.)

False, a correct model may have large sampling variability

b.)

False, many nuisance variables create higher sampling variability

c.)

False, \(R^2\) can be inflated with additional X variables, even if they do not add to the model.

d.)

True, since all values either use the SSE in their calculation either directly or with the log function.

e.)

True, the only case where \(SSE_P\) would equal \(Press_p\) is when the diagonal of the Hat matrix is all zero.

f.)

True, since \(BIC\) adds heavier penalty based \(plog(n)\) rather than the \(AIC\) whic just uses \(2p\).

g.)

True, since stepwise regression will pick the model with the optimal value of the criterion (usually the lowest).

Question 2

##       SSE.mean     
## [1,] 34.876531 7.00
## [2,] 34.621515 6.75
## [3,] 18.712425 6.50
## [4,]  7.229235 6.00
## [5,]  5.559057 5.50
## [6,]  5.013139 5.00
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 8.377e-06 2.141e-05 3.467e-05 4.587e-05 5.618e-05 2.348e-04
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 0.000e+00 0.000e+00 2.616e-17 8.674e-18 4.441e-16
##           [,1] [,2]
## [1,]  2.058886    2
## [2,]  3.075914    3
## [3,]  4.006368    4
## [4,]  5.890989    6
## [5,]  7.951301    8
## [6,] 10.045522   10

## a.) Yes there is a correct model, that contains the least amount of model bias, which as can be seen in the plot, is with a high model order.

b.)

plot(l.order, variance.ave, type='o',xlab="order of model", ylab="variance", main="model variance")

The model variance increases when we make more complex models with higher order. We can also see that as the model variance increases, the error variance decreases.

c.)

plot(l.order, bias2.ave, type='o',xlab="order of model", ylab="bias^2", main="squared model bias")

The model bias behaves opposite of the model variance, where it decreases as the model complexity increases. This relationship is parallel with the error variance, in which they both decrease as model complexity decreases.

d.)

Since the error variance behaves very much like the bias, we would say that the bias is the dominant component, so yes, it does depend on the error variance.

e.)

plot(
  l.order,
  err2.mean.ave,
  type = 'o',
  xlab = "order of model",
  ylab = "mean-squared-estimation-error",
  main = "mean-squared-estimation-error"
)

According to the mean-squared-estimation-error, the best model would be the highest order model with the most terms. The answer does not depend on the error variance but it does depend on the model variance.

f.)

The \(E(SSE)\) decreases as the order of the model decreases. As the noise increases, the expected value of the SSE grows larger with each simulation

plot(
  l.order,
  SSE.mean / (n - l.order - 1),
  type = 'o',
  xlab = "order of model",
  ylab = "E(MSE)",
  main = "E(MSE)"
)
abline(h = sigma.e ^ 2,  lty = 2, col = 2)

Question 3

a.)

# Loading in the data
diab <- read.table("diabetes.txt", header = T)
diab
any(is.na(diab$frame))
## [1] FALSE
diab$frame[which(diab$frame == "")] <- NA
any(is.na(diab$frame))
## [1] TRUE

b.)

drops=c("id","bp.2s", "bp.2d")
diab=diab[,!(names(diab)%in%drops)]

c.)

str(diab)
## 'data.frame':    403 obs. of  16 variables:
##  $ chol    : int  203 165 228 78 249 248 195 227 177 263 ...
##  $ stab.glu: int  82 97 92 93 90 94 92 75 87 89 ...
##  $ hdl     : int  56 24 37 12 28 69 41 44 49 40 ...
##  $ ratio   : num  3.6 6.9 6.2 6.5 8.9 ...
##  $ glyhb   : num  4.31 4.44 4.64 4.63 7.72 ...
##  $ location: chr  "Buckingham" "Buckingham" "Buckingham" "Buckingham" ...
##  $ age     : int  46 29 58 67 64 34 30 37 45 55 ...
##  $ gender  : chr  "female" "female" "female" "male" ...
##  $ height  : int  62 64 61 67 68 71 69 59 69 63 ...
##  $ weight  : int  121 218 256 119 183 190 191 170 166 202 ...
##  $ frame   : chr  "medium" "large" "large" "large" ...
##  $ bp.1s   : int  118 112 190 110 138 132 161 NA 160 108 ...
##  $ bp.1d   : int  59 68 92 50 80 86 112 NA 80 72 ...
##  $ waist   : int  29 46 49 33 44 36 46 34 34 45 ...
##  $ hip     : int  38 48 57 38 41 42 49 39 40 50 ...
##  $ time.ppn: int  720 360 180 480 300 195 720 1020 300 240 ...

The variables chol, stab.glu, hdl, ratio, glyhb, age, height, weight, bp.1s, bp.1d, waist, hip, and time.ppn are quantitative. The variables location, gender, and frame are qualitative.

library(ggplot2)
ggplot(data = diab, aes(x = glyhb)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 13 rows containing non-finite values (stat_bin).

The distribution of glyhb is very right skewed, with lots of values spreading to a very extensive tail.

library(tidyr)
ggplot(gather(diab[, c(
  "chol",
  "stab.glu",
  "hdl",
  "ratio"
)], cols, value), aes(x = value)) +
  geom_histogram(binwidth = 20) + facet_grid(. ~ cols)
## Warning: Removed 3 rows containing non-finite values (stat_bin).

ggplot(gather(diab[, c(
  "age",
  "height",
  "weight",
  "bp.1s"
)], cols, value), aes(x = value)) +
  geom_histogram(binwidth = 20) + facet_grid(. ~ cols)
## Warning: Removed 11 rows containing non-finite values (stat_bin).

ggplot(gather(diab[, c(
  "bp.1d",
  "waist",
  "hip",
  "time.ppn"
)], cols, value), aes(x = value)) +
  geom_histogram(binwidth = 30) + facet_grid(. ~ cols)
## Warning: Removed 12 rows containing non-finite values (stat_bin).

pie(table(diab$location))

pie(table(diab$gender))

pie(table(diab$frame))

d.)

diab$logGlyhb <- log(diab$glyhb)
diab$recipGlyhb <- 1 / (diab$glyhb)
diab$sqrtGlyhb <- sqrt(diab$glyhb)

ggplot(data = diab, aes(x = logGlyhb)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 13 rows containing non-finite values (stat_bin).

ggplot(data = diab, aes(x = recipGlyhb)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 13 rows containing non-finite values (stat_bin).

ggplot(data = diab, aes(x = sqrtGlyhb)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 13 rows containing non-finite values (stat_bin).

After multiple transformations, it seems that the transformation of \(\frac{1}{glyhb}\) appears the most normal.

diab$"glyhb*" <- diab$recipGlyhb
diab$glyhb <- diab$"glyhb*"

e.)

library(GGally)
library(plotly)
ggplotly(ggpairs(diab[,c("chol", "stab.glu", "hdl", "ratio", "age", "height", "weight", "bp.1s", "bp.1d", "waist", "hip","time.ppn")]))

From the scatterplot matrix, there are no signs of non-linearity in the data amongst the quantitative predictor variables.

f.)

ggplot(data = diab, aes(x = glyhb, color = gender)) + geom_boxplot()
## Warning: Removed 13 rows containing non-finite values (stat_boxplot).

ggplot(data = diab, aes(x = glyhb, color = frame)) + geom_boxplot()
## Warning: Removed 13 rows containing non-finite values (stat_boxplot).

The difference between genders is very minimal. The difference in Glycosolated Hemoglobin for different frames are more noticeable, with large frames having the most spread and the lowest median. Small frames has the highest median. Medium frame has most outliers.

g.)

set.seed(10)
library(caTools)
sample <- sample.split(diab$glyhb, SplitRatio = 0.5)
train  <- subset(diab, sample == TRUE)
test   <- subset(diab, sample == FALSE)

# simple way is to put labels on both datasets and remerge for comparisons 
train$label <- rep("train", nrow(train))
test$label <- rep("test", 
                  
                  nrow(test))

# Combined dataset now has labels for which is in train and which is in test/validation
combin <- rbind(train, test)
combin
ggplot(data = combin, aes(x = glyhb, color = label)) + geom_boxplot()
## Warning: Removed 13 rows containing non-finite values (stat_boxplot).

ggplot(data = combin, aes(x = stab.glu, color = label)) + geom_boxplot()

ggplot(data = combin, aes(x = ratio, color = label)) + geom_boxplot()
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

ggplot(data = combin, aes(x = age, color = label)) + geom_boxplot()

ggplot(data = combin, aes(x = bp.1s, color = label)) + geom_boxplot()
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).

ggplot(data = combin, aes(x = waist, color = label)) + geom_boxplot()
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

We can see from the boxplots that the difference between the training and the validation data sets is very minimal. All these variables have similar distributions.

Question 4

a.)

diab <- diab[,c(-17, -18, -19, -20)]
diab
# Fitting a first order model
# lm(glyhb ~ chol + stab.glu + hdl + ratio + factor(location) + age + factor(gender) + height)
model1 <- lm(glyhb~., diab)
anova(model1)

There are 15 regression coefficients along (16 with the intercept). The \(MSE = 0.00133\).

library(MASS)
boxcox(model1)

The Box-cox method shows that no further transformation of the response variable is needed.

b.)

library(leaps)
reg <- regsubsets(glyhb~., data = diab, nvmax = 16)
sumReg <- summary(reg)
sumReg
## Subset selection object
## Call: regsubsets.formula(glyhb ~ ., data = diab, nvmax = 16)
## 16 Variables  (and intercept)
##                Forced in Forced out
## chol               FALSE      FALSE
## stab.glu           FALSE      FALSE
## hdl                FALSE      FALSE
## ratio              FALSE      FALSE
## locationLouisa     FALSE      FALSE
## age                FALSE      FALSE
## gendermale         FALSE      FALSE
## height             FALSE      FALSE
## weight             FALSE      FALSE
## framemedium        FALSE      FALSE
## framesmall         FALSE      FALSE
## bp.1s              FALSE      FALSE
## bp.1d              FALSE      FALSE
## waist              FALSE      FALSE
## hip                FALSE      FALSE
## time.ppn           FALSE      FALSE
## 1 subsets of each size up to 16
## Selection Algorithm: exhaustive
##           chol stab.glu hdl ratio locationLouisa age gendermale height weight
## 1  ( 1 )  " "  "*"      " " " "   " "            " " " "        " "    " "   
## 2  ( 1 )  " "  "*"      " " " "   " "            "*" " "        " "    " "   
## 3  ( 1 )  " "  "*"      " " "*"   " "            "*" " "        " "    " "   
## 4  ( 1 )  " "  "*"      " " "*"   " "            "*" " "        " "    " "   
## 5  ( 1 )  " "  "*"      " " "*"   " "            "*" " "        " "    " "   
## 6  ( 1 )  " "  "*"      " " "*"   "*"            "*" " "        " "    " "   
## 7  ( 1 )  "*"  "*"      " " "*"   "*"            "*" " "        " "    " "   
## 8  ( 1 )  "*"  "*"      " " "*"   "*"            "*" " "        " "    " "   
## 9  ( 1 )  "*"  "*"      " " "*"   "*"            "*" " "        " "    " "   
## 10  ( 1 ) "*"  "*"      " " "*"   "*"            "*" " "        "*"    "*"   
## 11  ( 1 ) "*"  "*"      " " "*"   "*"            "*" " "        "*"    "*"   
## 12  ( 1 ) "*"  "*"      " " "*"   "*"            "*" " "        "*"    "*"   
## 13  ( 1 ) "*"  "*"      "*" "*"   "*"            "*" " "        "*"    "*"   
## 14  ( 1 ) "*"  "*"      "*" "*"   "*"            "*" " "        "*"    "*"   
## 15  ( 1 ) "*"  "*"      "*" "*"   "*"            "*" " "        "*"    "*"   
## 16  ( 1 ) "*"  "*"      "*" "*"   "*"            "*" "*"        "*"    "*"   
##           framemedium framesmall bp.1s bp.1d waist hip time.ppn
## 1  ( 1 )  " "         " "        " "   " "   " "   " " " "     
## 2  ( 1 )  " "         " "        " "   " "   " "   " " " "     
## 3  ( 1 )  " "         " "        " "   " "   " "   " " " "     
## 4  ( 1 )  " "         " "        " "   " "   "*"   " " " "     
## 5  ( 1 )  " "         " "        " "   " "   "*"   " " "*"     
## 6  ( 1 )  " "         " "        " "   " "   "*"   " " "*"     
## 7  ( 1 )  " "         " "        " "   " "   "*"   " " "*"     
## 8  ( 1 )  "*"         " "        " "   " "   "*"   " " "*"     
## 9  ( 1 )  "*"         " "        "*"   " "   "*"   " " "*"     
## 10  ( 1 ) " "         " "        " "   " "   "*"   "*" "*"     
## 11  ( 1 ) "*"         " "        " "   " "   "*"   "*" "*"     
## 12  ( 1 ) "*"         " "        "*"   " "   "*"   "*" "*"     
## 13  ( 1 ) "*"         " "        "*"   " "   "*"   "*" "*"     
## 14  ( 1 ) "*"         " "        "*"   "*"   "*"   "*" "*"     
## 15  ( 1 ) "*"         "*"        "*"   "*"   "*"   "*" "*"     
## 16  ( 1 ) "*"         "*"        "*"   "*"   "*"   "*" "*"
# Dataframe for all the criterion (AIC is not printed by regsubset)
data.frame(
  SSE = sumReg$rss,
  R2 = sumReg$rsq,
  R2adj = sumReg$adjr2,
  Cp = sumReg$cp,
  BIC = sumReg$bic
)
### Now we find which model had the optimum value 
# SSE 
which.min(sumReg$rss)
## [1] 16
# R2
which.max(sumReg$rsq)
## [1] 16
# R2adj
which.max(sumReg$adjr2)
## [1] 7
# Cp
which.min(sumReg$cp)
## [1] 7
# BIC 
which.min(sumReg$bic)
## [1] 4

c.)

diab <- na.omit(diab)
nullModel1 <- lm(glyhb~1, data = diab)
stepAICfor <- stepAIC(
  nullModel1,
  scope = list(upper = "~chol + stab.glu + hdl + ratio + factor(location) + age + factor(gender) +height+weight+factor(frame)+bp.1s+bp.1d+waist+hip+time.ppn",
  lower = ~ 1), direction = "forward", k = 2, trace = FALSE
)

stepAICfor$anova
#Since we know that Cp and AIC are proportional, we can pick the model based off the Cp, which was model 7, we build it here
model7 <- lm(glyhb ~ chol + stab.glu + ratio + factor(location) + age + factor(frame) +
    waist + time.ppn,
  data = diab
)
AIC(model7)
## [1] -1379.226
# Pick the final model from forward selection 
modelfs1 <- lm(stepAICfor$model)

#directioly compare the models
model7
## 
## Call:
## lm(formula = glyhb ~ chol + stab.glu + ratio + factor(location) + 
##     age + factor(frame) + waist + time.ppn, data = diab)
## 
## Coefficients:
##            (Intercept)                    chol                stab.glu  
##              3.524e-01              -6.593e-05              -4.968e-04  
##                  ratio  factor(location)Louisa                     age  
##             -2.912e-03               6.846e-03              -6.341e-04  
##    factor(frame)medium      factor(frame)small                   waist  
##             -3.853e-03              -1.350e-03              -1.133e-03  
##               time.ppn  
##             -1.180e-05
modelfs1
## 
## Call:
## lm(formula = stepAICfor$model)
## 
## Coefficients:
##            (Intercept)                stab.glu                     age  
##              3.494e-01              -4.952e-04              -6.216e-04  
##                  ratio                   waist                time.ppn  
##             -2.910e-03              -1.093e-03              -1.177e-05  
## factor(location)Louisa                    chol  
##              6.523e-03              -7.209e-05

We can see that regsubset model picks the variables chol, stab,glu, ratio, location, age, frame, waist, and time.ppn. The forward selection picks the model with all of the same variables but does not include frame. Lets see the difference in AICs.

AIC(model7)
## [1] -1379.226
AIC(modelfs1)
## [1] -1382.494

We can see that modelfs1 has the lowest AIC.

d.)

plot(modelfs1)

We can see that the Residuals versus Fitted values shows random scatter, with no obvious signs of patterns. From the Normal Q-Q plot, we can see approximate normality in the residuals. Overall, the model appears to be adequate.

Question 5

a.)

model2 <- lm(glyhb ~ .^2, data = diab)
aovModel2 <- anova(model2)

# Get the last value, which is the MSE
tail(aovModel2$`Mean Sq`, n = 1)
## [1] 0.001221709

Immediately my concern is that we have a extremely large amount of unnecessary variables, mostly in the form of interactions.

b.)

stepAICfor2 <- stepAIC(
  nullModel1,
  scope = list(lower = nullModel1, upper = model2),
  k = 2,
  direction = "forward",
  trace = F
)

modelfs2 <- lm(stepAICfor2$model)

# Compare to first model 
AIC(modelfs1)
## [1] -1382.494
AIC(modelfs2)
## [1] -1395.439

We notice that the model with interaction terms has a lower AIC than the model with no interaction term.

c.)

plot(modelfs2)

Both the residual vs fitted values plot and the normal qq plot show that the model is adequate. There is no patterns in the residuals versus fitted values and the normal qq plot shows no deviations from approximate normality.

d.)

# We perform forward stepwise procedure using AIC
stepAIC(
  nullModel1,
  scope = list(lower = nullModel1, upper = modelfs2),
  k = 2,
  direction = "forward"
)
## Start:  AIC=-2173.55
## glyhb ~ 1
## 
##            Df Sum of Sq     RSS     AIC
## + stab.glu  1   0.39753 0.56182 -2367.4
## + age       1   0.15016 0.80918 -2233.9
## + ratio     1   0.12108 0.83827 -2220.9
## + waist     1   0.09755 0.86180 -2210.8
## + location  1   0.00655 0.95280 -2174.1
## <none>                  0.95934 -2173.6
## + time.ppn  1   0.00126 0.95809 -2172.0
## 
## Step:  AIC=-2367.39
## glyhb ~ stab.glu
## 
##            Df Sum of Sq     RSS     AIC
## + age       1  0.048669 0.51315 -2398.6
## + waist     1  0.028792 0.53303 -2384.6
## + ratio     1  0.027939 0.53388 -2384.1
## + location  1  0.005821 0.55600 -2369.2
## + time.ppn  1  0.004368 0.55745 -2368.2
## <none>                  0.56182 -2367.4
## 
## Step:  AIC=-2398.55
## glyhb ~ stab.glu + age
## 
##                Df Sum of Sq     RSS     AIC
## + ratio         1 0.0214844 0.49167 -2412.2
## + waist         1 0.0212472 0.49190 -2412.0
## + stab.glu:age  1 0.0090485 0.50410 -2403.1
## + location      1 0.0052488 0.50790 -2400.3
## + time.ppn      1 0.0047858 0.50836 -2400.0
## <none>                      0.51315 -2398.6
## 
## Step:  AIC=-2412.21
## glyhb ~ stab.glu + age + ratio
## 
##                Df Sum of Sq     RSS     AIC
## + waist         1 0.0125184 0.47915 -2419.7
## + stab.glu:age  1 0.0069134 0.48475 -2415.4
## + time.ppn      1 0.0056432 0.48602 -2414.4
## + location      1 0.0053102 0.48636 -2414.2
## <none>                      0.49167 -2412.2
## + age:ratio     1 0.0004177 0.49125 -2410.5
## 
## Step:  AIC=-2419.65
## glyhb ~ stab.glu + age + ratio + waist
## 
##                Df Sum of Sq     RSS     AIC
## + time.ppn      1 0.0064607 0.47269 -2422.6
## + stab.glu:age  1 0.0051982 0.47395 -2421.6
## + location      1 0.0044587 0.47469 -2421.1
## <none>                      0.47915 -2419.7
## + age:ratio     1 0.0008567 0.47829 -2418.3
## 
## Step:  AIC=-2422.62
## glyhb ~ stab.glu + age + ratio + waist + time.ppn
## 
##                     Df Sum of Sq     RSS     AIC
## + stab.glu:time.ppn  1 0.0127963 0.45989 -2430.7
## + stab.glu:age       1 0.0064115 0.46627 -2425.6
## + location           1 0.0031451 0.46954 -2423.1
## <none>                           0.47269 -2422.6
## + age:ratio          1 0.0011950 0.47149 -2421.5
## 
## Step:  AIC=-2430.66
## glyhb ~ stab.glu + age + ratio + waist + time.ppn + stab.glu:time.ppn
## 
##                Df Sum of Sq     RSS     AIC
## + stab.glu:age  1 0.0068912 0.45300 -2434.2
## + location      1 0.0041805 0.45571 -2432.0
## <none>                      0.45989 -2430.7
## + age:ratio     1 0.0016075 0.45828 -2429.9
## 
## Step:  AIC=-2434.19
## glyhb ~ stab.glu + age + ratio + waist + time.ppn + stab.glu:time.ppn + 
##     stab.glu:age
## 
##             Df Sum of Sq     RSS     AIC
## + location   1 0.0045058 0.44849 -2435.8
## + age:ratio  1 0.0033804 0.44962 -2434.9
## <none>                   0.45300 -2434.2
## 
## Step:  AIC=-2435.85
## glyhb ~ stab.glu + age + ratio + waist + time.ppn + location + 
##     stab.glu:time.ppn + stab.glu:age
## 
##             Df Sum of Sq     RSS     AIC
## + age:ratio  1 0.0027568 0.44574 -2436.1
## <none>                   0.44849 -2435.8
## 
## Step:  AIC=-2436.1
## glyhb ~ stab.glu + age + ratio + waist + time.ppn + location + 
##     stab.glu:time.ppn + stab.glu:age + age:ratio
## 
## Call:
## lm(formula = glyhb ~ stab.glu + age + ratio + waist + time.ppn + 
##     location + stab.glu:time.ppn + stab.glu:age + age:ratio, 
##     data = diab)
## 
## Coefficients:
##       (Intercept)           stab.glu                age              ratio  
##         3.355e-01         -7.876e-04         -8.332e-04          2.777e-03  
##             waist           time.ppn     locationLouisa  stab.glu:time.ppn  
##        -1.034e-03          3.580e-05          6.642e-03         -4.619e-07  
##      stab.glu:age          age:ratio  
##         7.341e-06         -1.209e-04

Performing forward selection procedure based on AIC, we find the best model is the modelfs2, reassuring that our modelfs2 is the best choice in minimizing AIC.